User Performance with Continuous Speech Recognition Systems
نویسندگان
چکیده
The University of Michigan Rehabilitation Engineering and Research Center (UM RERC) on Ergonomics is just beginning a three-year study on user performance with continuous speech recognition systems. The application of speech recognition to the computer access needs of people with disabilities continues to grow, and a greater understanding of user performance with such systems is needed. This paper outlines what is known about user performance with speech recognition systems and presents the plan of a project designed to enhance understanding in this area. STATEMENT OF THE PROBLEM Continuous speech recognition (CSR) systems have the potential to greatly improve the productivity and comfort of performing computer-based tasks for a wide variety of users. These systems allow data input into a computer simply by speaking into a microphone, without requiring the speaker to pause between each word. For users whose severe physical disabilities require them to have some sort of hands-free access to a computer, CSR is an attractive option compared to potentially less efficient methods such as mouthstick typing or two-switch Morse code. For users whose use of "standard" manual input methods has led to a repetitive stress injury or other serious biomechanical stress, CSR may provide a productive alternative to continued discomfort and exacerbation of the injury, freeing users from keyboard use and its associated postural constraints. While the promise of CSR is enormous and sales of voice recognition systems continue to grow, some basic questions regarding user performance with voice recognition have not been satisfactorily addressed. These include: 1. What is the range of productivity that a user of CSR system can expect? How does this depend on the characteristics of both the user and the task? 2. What is the learning curve associated with CSR systems? How long does it take to develop a high degree of proficiency? 3. Are there human factors costs that may partially counteract the benefits of using CSR systems? 4. If so, are there methods of assessing for and delivering CSR systems that can reduce the impact of these costs and result in improved user satisfaction and productivity? This study will provide a fuller understanding of the role of CSR systems in meeting the needs of people with disabilities by addressing these questions. The project will apply this new understanding to devise and evaluate methods of improving user performance with CSR systems. BACKGROUND Voice Recognition (VR) Systems Automatic voice recognition (VR) has been under development since the early 1970's. Early systems could recognize only a handful of discrete words or utterances. By the late 1980's, recognition vocabularies of several thousand words became available, with the requirement that the user speak each word consistently and discretely, with short pauses between words. Discrete word VR systems have continued to improve in vocabulary size and recognition accuracy. In 1997 a major breakthrough in VR technology occurred with the first consumer-affordable continuous speech recognition system. Continuous speech allows users to speak at their natural pace and rhythm, resulting in faster and potentially more satisfactory interaction. User-System Performance for Discrete VR Systems The vast majority of existing literature on user performance with VR deals with discrete systems only. One metric of user-system performance is the recognition rate, measured as percent of words accurately recognized. One early system, with a limited vocabulary of 70 utterances, was able to recognize up to 90% accurately for a well-trained ablebodied subject (Dabbagh and Damper, 1985). More advanced systems, with several thousand word vocabularies, have reported recognition rates of 94% to 98% for well-trained subjects with and without severe upper extremity disabilities (Karl et al., 1993; Dalton et al., 1997). This is comparable to the accuracy of a skilled typist or mouthstick user (Dalton et al., 1997). A second performance metric is overall user productivity. Discrete VR systems may or may not provide improved performance relative to standard input methods, depending on the task and subject population. For example, Karl et al. USER PERFORMANCE WITH SPEECH RECOGNITION (1993) observed that when able-bodied subjects used voice instead of a mouse to enter word processing commands, time for four specific tasks was reduced by 19%. In a similar experiment for spreadsheet tasks, however, subjects performed more slowly with the voice interface (Molnar, 1996). Zemmel (1996a, 1996b) concluded that discrete VR was inadequate for medical emergency room and radiology dictation based on observed performance in those environments. For general dictation and text entry, which is an important VR application for users with disabilities, performance with discrete VR systems has steadily improved over the years. For one early system, in which the user spelled out each word using the military alphabet, text entry rates of approximately 8 words per minute (wpm) were achieved (Dabbagh and Damper, 1985). By 1997, rates for highly skilled able-bodied users approached 25 30 wpm (Mello, 1997), not generally competitive with skilled touch typists but perhaps sufficiently fast for certain workplace dictation tasks. There are very few reports directly comparing text entry rate with VR to other input methods for users with physical disabilities. In one single case study, a well-trained user with a high level spinal cord injury achieved 20 wpm with a discrete VR system, as compared to 13 wpm using his mouthstick on a standard keyboard (Dalton et al., 1997). Human Factors Issues in Discrete VR Systems Human factors issues with discrete VR systems are an important influence on user performance. While several such issues have been mentioned in the literature, including learning/training, other cognitive and perceptual aspects of interacting with a VR system, the capacity of the human vocal system, and the task environment, there are very few specific reports examining their quantitative impact on user-system performance. Learning and training is one of the most frequently mentioned issues (e.g., Horner et al., 1993; Biermann et al., 1992). For successful use of a discrete VR system, the system must learn how the user speaks, which typically involves a standard enrollment process where the user says specific words in response to system prompts. The user must learn how to speak in such a way as to maximize recognition accuracy, by using a consistent tone of voice and the proper pause between each word. The user must also learn the most effective technique for correcting the system when the inevitable misrecognition occurs. The time and effort involved in learning effective use of a discrete VR system, as well as to repeatedly decide on the optimal correction strategy for each misrecognition, are examples of the "cognitive cost" of using the system. Other examples are the conscious effort required to speak each word discretely and then attend to the system's recognition response. This response typically takes the form of a "pick list" of candidate words that potentially match the user's utterance; the user has the option of visually searching this list for the correct word and choosing it verbally. The presence of these cognitive activities is what primarily distinguishes use of discrete VR from speaking naturally in a conversation. The need to frequently engage in them during human-computer interaction can be both tiring and timeconsuming to the user (Card, Moran, and Newell, 1983; Koester and Levine, 1996). For example, in a clinical case study the time involved in correcting misrecognized words accounted for more than 50% of the task time (Koester and Hilker, 1995, unpublished). There has also been some suggestion in the literature that use of voice recognition can have unanticipated physical consequences. While decreasing the biomechanical load on upper extremities and postural systems, discrete VR can exact a greater load on the vocal system. This may cause only minor discomfort for some, but Kambeyanda et al. (1997) report on four individuals who developed chronic vocal stress requiring treatment after one year of using a discrete VR system. Finally, the conditions of the work environment in which VR is used can have a significant impact on user performance. Key characteristics include placement and stability of the microphone, workplace background noise, and the extent to which VR use disturbs others in the environment. Zemmel et al. (1996a, 1996b) found VR not suitable for hospital emergency room or radiology environments due to background noise and other environmental issues. Subjective comments of discrete VR users corroborate the presence of some significant human factors issues. Even users who enjoy using VR overall have commented on short term memory challenges and consequent interference with task domain, "tedious" nature of talking all the time, voice fatigue, and the frustration of attending to and correcting errors (Biermann et al., 1992). User-System Performance for Continuous VR Systems USER PERFORMANCE WITH SPEECH RECOGNITION Continuous speech recognition (CSR) systems which will recognize tens of thousands of words are now available for less than a few hundred dollars. Popular reviews of such systems suggest that users can employ natural speech at their natural pace, with resulting dictation speed of up to 100 wpm and 95% recognition accuracy (Mello, 1997; O'Malley, 1997). However, we have found no empirical validation of these claims in the literature, either for "mainstream" or physically disabled users. While the ability to use natural speech at a natural pace would be expected to reduce the impact of human factors issues on performance with CSR, many of the cognitive and perceptual activities required for interaction with a discrete VR system are still present with continuous speech recognition. In particular, misrecognition errors still occur, and need to be identified and corrected. This process is in fact somewhat more complicated than with discrete VR, since there are more choices for when to check for errors and how to correct for them. Effective interaction still requires development of a mental model of how the system works, an understanding of which error correction strategy is best suited to a particular situation, shared attention between the task domain and output of the CSR system, and memorization of specific commands for executing the chosen error correction strategy. To date there have been no reports of how these activities impact user performance and satisfaction or how to design effective training interventions to reduce any negative impact they may have.
منابع مشابه
Speaker Adaptation in Continuous Speech Recognition Using MLLR-Based MAP Estimation
A variety of methods are used for speaker adaptation in speech recognition. In some techniques, such as MAP estimation, only the models with available training data are updated. Hence, large amounts of training data are required in order to have significant recognition improvements. In some others, such as MLLR, where several general transformations are applied to model clusters, the results ar...
متن کاملSpeaker Adaptation in Continuous Speech Recognition Using MLLR-Based MAP Estimation
A variety of methods are used for speaker adaptation in speech recognition. In some techniques, such as MAP estimation, only the models with available training data are updated. Hence, large amounts of training data are required in order to have significant recognition improvements. In some others, such as MLLR, where several general transformations are applied to model clusters, the results ar...
متن کاملSpoken Term Detection for Persian News of Islamic Republic of Iran Broadcasting
Islamic Republic of Iran Broadcasting (IRIB) as one of the biggest broadcasting organizations, produces thousands of hours of media content daily. Accordingly, the IRIBchr('39')s archive is one of the richest archives in Iran containing a huge amount of multimedia data. Monitoring this massive volume of data, and brows and retrieval of this archive is one of the key issues for this broadcasting...
متن کاملPersian Phone Recognition Using Acoustic Landmarks and Neural Network-based variability compensation methods
Speech recognition is a subfield of artificial intelligence that develops technologies to convert speech utterance into transcription. So far, various methods such as hidden Markov models and artificial neural networks have been used to develop speech recognition systems. In most of these systems, the speech signal frames are processed uniformly, while the information is not evenly distributed ...
متن کاملContinuous Hindi Speech Recognition using Monophone based Acoustic Modeling
Speech is a natural way of communication and it provides an intuitive user interface to machines. Although the performance of automatic speech recognition (ASR) system is far from perfect. The overall performance of any speech recognition system is highly depends on the acoustic modeling. Hence generation of an accurate and robust acoustic model holds the key to satisfactory recognition perform...
متن کاملبهبود عملکرد سیستم بازشناسی گفتار پیوسته بوسیله ویژگیهای استخراج شده از مانیفولدهای گفتاری در فضای بازسازی شده فاز
The design for new feature extraction methods out of the speech signal and combination of their obtained information is one of the most effective approaches to improve the performance of automatic speech recognition (ASR) system. Recent researches have been shown that the speech signal contains nonlinear and chaotic properties, but the effects of these properties are not used in the continuous ...
متن کامل